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Estimation of basic vector 
Input : M'xN matrix data A' 

: Number of desired basic vectors k 

: Threshold X 

: Scaling offset value \i 
Output: k basic vectors {b;:i=l,...,k} 

(A'.*.4./f.b)( 

A' : (file) pointer to M'xN matrix 
R..R.: M* dimensional vector 

r : N dimensional vector (denotes as Rj[j] but without need for holding M'xN) 
C: NxN matrix; //for holding covariance matrix 
w^t : double; 
first : boolean; 

first -true; 

for (int p = l;/>£*;p++M 
if (I first) { 

for (int i = 1; / £ M *; i + +){ //step 1 : selective seal i ng 

' H r i \ * II obtain the length of each model vector 

if (| R,[i]\> X){ //inner product with basic vector is larger 

if(/f J [/J>0.0)i V =(l-J?,[#])^; 

eiseK^CI + W^; 

for (int j = \;j<N;j++){ 
^ If] U] * M * //scaling 

I 

) 

else continue; 

} 
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C = — -"step 2: calculate the covariance matrix 
A/' j.i 

b f = EVD(C); //step 3: eigenvector for maximum eigenvalue 
MGS(b r ); //step 4: Modified Gram Schmidt 
outpu /(!%); //output the i-th basic vector 

for (int i = l;/'<jV/*;/ + +){//step5:computethecontributionmatrix 
for (int j = \;j<NJ++){ 

KP]+=R,[j\*b r {j]: 

1 

if CAM— 0 0) > J [/]=0.0; 
else RJi/=RJi}>jKjij; 

I 

for (int » = !;/£ A/ ';/ + +){ //step 6: compute the residual vector 
for (int y = t;y<A/;7++){ 

R, 01 = R i DJ - R mPJ x b ,[A. 

I 

I 

if (first) first = false; 
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=====.=_=_== ====================== | oop: r - ]i 

largest imerProduct = 0.99983990000413654. 
smallest irmerProduct = -0.00211350153052008864- 

4 

0K plus! doc = 94636 cnt = H 

OK plus'. iP = 0.99983990000413654 

(extrinsic) keywd = suzuki4 
(extrinsic) keywd = samurai 4 
(extrinsic) keywd = japan 4. 

OK plus! doc = 50840 cnt = 24. 

OK plus! iP = 0.97430649448735344. 

(extrinsic) keywd = suzuki 4 
(extrinsic) keywd = samurai 4- 
(extrinsic) keywd = sale4 

OK plus! doc = 92853 cnt = 34- 

0K plus! iP = 0.93728853749439624 

(extrinsic) keywd = suzuki 4 
(extrinsic) keywd - samurai 4 
(extrinsic) keywd = sale4 

OK plus! doc = 68088 cnt = 44 

OK plus! iP = 0.82394389608642724 

(extrinsic) keywd = suzuki 4 
(extrinsic) keywd = japan4 
(extrinsic) keywd = plan4 

OK plus!—- doc = 2733 cnt = 54 

OK Plus! --- iP = 0.76176156670122424 

(extrinsic) keywd = suzuki 4 , - 
(extrinsic) keywd = samurai 4 
(extrinsic) keywd = car4 
(intrinsic) keywd2 = sheriff4 

OK plus! doc = 108212 cnt = 64 

OK plus!-- iP = 0.712350127090532H 



(extrinsic) keywd = suzuki 4 
(extrinsic) keywd = samurai 4 
(extrinsic) keywd = car4 
(intrinsic) keywd2 = asher4 

CK plus! doc = 79412 cnt = 74 

OK plus! iP = 0.62365219121659354. 

(extrinsic) keywd = suzuki4 
(extrinsic) keywd = maker4 
(extrinsic) keywd = resign4 
(intrinsic) keywd2 = tire4 
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Input 



Output 



k basic vectors 

NTxN matrix data A* 

keyword data keyword (N) 

Number p of keywords representing cluster 

Threshold 5 for separating cluster 

stop Word for use in cluster labeling 

labeled cluster set 



(A , b.keyword^topWord.p^clusters^ 
keyword : String[N]; II holding the keyword 
stopWord : StringQ; II holding the stop word list 
b : double[k][N]; ff k N -dimensional basic vectors 

min Value, maxValue : double[M T[k]; II holding the maximum and minimum of inner product 
minlndex, max Index : integer[M *J[*];V/ holding the maximum and minimum index of inner product 
model : double[N] ; .// holding the i-th record of A ' 

keyword Index :integer[M //holding the index corresponding to keyword 
innerProduct ; double[M 1; .11 holding the value of inner product 
index\Jndex2:integer[M *); // holding the index of data 
maxModel : double; !! holding the maximum value of data 
cluster\,clusfer2 : variable - length integer array; II holding the cluster data 
labellJabeH : String; II label of cluster (for output) 

// step 1 : initialization of various kinds of data 
clusteri - cluster! = null; If initialization of cluster 
labell = label! -null; I! initialization of cluster label 
for (int r = l;r£*;r++){ 
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for (int j s l;j £ p ;j + +) // initialization of index 

keywordln dex{i][j ] = -l; 
maxModel - 0.0; 

String line = rea</(/-th record 0/ 4); 
while ( WAfore7bfcr/u-()){ // 2: record processing for A' 
int? = getTofcen(line); 
mode! [q] « DoubleigetToken ()); 
if {modet{q\> maxVatue){ 

keyword fndex[i][l]= q; 
maxModel = modef[q\\ 

7 = 1; 

while (*epvor<//ni/c4/][y] * -J 
and /</;){ 
keyword!ndex[i][j + \) = 

/++; 

I 

} 

) 

<fo«We / = /I = f2 = 0.0; 

for (int y = i;y<^;/++>{//step3:a>mputethesrmilarity 

/2+ = model\J)x mode([j]\ 
if {/l>war^wetf][r])( 

maxKc/i/e[i]frJ = rI; 

mat//frfex[i][r] = y; 
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I 

if (n<minVafue[i][r])l 

minlndex[i][r] = j; 

I 

t + =tl; 

1 

copyArray{index\,mdex!y, // index 2 to index I 

/ = / / 4t2; II normalization 

innerProduct[i] = f ; II inner product with basic vector 

I 

// step 4: sorting and deciding process of cluster 
SortIndexedDescend{M \index\ 7 innerProduct)\ II sorting 
// cluster process of basic vector in positive direction 

while (itmerProduct[i) > 5 and i <, M'){ 
addCluster{cluster\ t 

index\i}^ keyword, p, maxlndex\i\ t keylndex[i])\ 

J 

II cluster process of basic vector in negative direction 
i = M'; 

while (-innerProduct[i] > S and * > |){ 
addCluster{cluster!, 

index[i] y keyword, p, minlndex[i) 9 keylndex[i]); 

i — ; 

} 

// step 5: cluster classification and labeling 

makeClusterLabeUr, stopWord, cluster^ cluster!, labeli, label!); 

output {r, cluster^ cluster!, label 1, label!); 
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bas i s vector .cluster label . f of - docs; percent, type* 
l+V'SuzUci, samurai, car, sale,. Japan x?3,.0.(»5.»olsei . 
2*;3lixon. China, library, ffetersate, Bei>ir«r..t342. ,1.051 .Major* 
3+,'Gbina, Beijrog. MCA, Chinese, Soviet ,2662,2.084,tejor+ 

'aspen. Colorado, Chicago, condominium, tent ,.|T3..0-. 088, Ifcjse* 
5+; "hospital, patient, California, health, trauma , 16594, 12.990,Maior* 
5+ r "Rjssein. adainist.ration, oil. «ulf r Jdrdbn .97jB.O. 786-.Qil lifer*-. 
5+r'canpaign. candidate, measure., private, cl-iange .690, 0.540, (W1>er* 
$+ ; . "Iraa. Kuwait. Hussein. Iraqi * Bush",3235.2.532,Majoi** 
6+,'Israeli, Muslim; nation, wlestine. party,,240,0.188.0ut ier^ 
e-.'trarfiplant, veronica, liver, chi Idi organ .130.0. 102. Out I jerj. 
7t, r GM, Plant, auto; Ford. car'. 847.0. ^Gutlier^ 
7-. T cnjisei pole, cabin, coach. Garjbbean^86,0..0$7,Noise4. 
8*.'cruise, BPA» ship, apple, chemical ,777.0. 608, Out I iefl 
9+;"bird. fish, nildli-fe. species, endanger .21 9, 0.583. Out 1 jer* 

"school. Bush, administration, cook. cHil<r.659.0.5T6.0utl «er4. 
.10+. "Beijing, China* Chinese, army. troop".742,0.®1.,COt Uerir 
tOVteam. coach* school, league. UCLA JOm,&m.)toior+ . 
10-. "Matsushita, conpany. Japanese, boycott, „f li» A ,1 16.0.091 .Noise-t 
U*, 'school. Amazon. Brazil, teacher, forest .2273. 1.779. Major* 
12t. r team, coach, school, league, football ,8307, 6\ 503. Major* 

12- .'tax, council. Bush. port. Japan .3148.2v464.Maior4 

13+,'duke; Louisiana* basketball, devil, campaign .381. 0,298.Out I ier*. 

13- .'tax; school, budget, council, court ".167^0,131.0^ I ieri 
14+v'school, teacher i child, educat ion; class ,3057:. 2. 393. Major* 
iKvpolice, officer;: arrest, cocaine, car ,3187,2. 479*Major* 
lS^.^Bush. art , school , acid, attain!^ rat ion ^WwifllAisOfc 
15~ r 'rauseu«, industry, collection. Japanese; house ,21 r.O.lBSj^lter-i 
l6+,' ! BusK. admihistfat ion. president, congress. U.S. ,2515,]. 969. Major* 
16-i'art, museum,. Florida, music; admission^ ,2737. 2.143,Maj.or+ 



17*. 'team, coach; league, inning, point ,9661. 7. 563,Majbr* 
i7.+."cani>aigri, eteetifcri, commission, Asia, counci Inan .Joo.u-.euu, 
17-. "school, bus, court , teacher; educat i on . 1119,0. 876 . Out I ier* 



18*. 'Bush, art. museum, artist, house ,287t.2.247.MajofA 
18s'coi*>ahy, price, loan* AIDS, market',2072.1. 622^0^ 
19+,'couricil. candidate, election, campaign, school ,4231, 3. 312, Major* 
19S?BusK t tax< art, budget, adm i n » st rat j on' • 1 272 . 0 . 996 ..Out I i e r* 
20+. 'race, ascot, midget, car. park .166, 0.1 30 .Out I ier*. 
20",.^***. Libya. Africa. Morocco, country ,1096. 0.858. Out I ier* 
20-,'AIC. Kadafi. ability, ally, al Hance/,155,0/121,0ut I ier 4 
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Cluster 
name 


Kind of 
cluster 


Number of 
documenK CY* 

uvfv-uiiiviiu V 


> Set of keywords 


M-1 


major 




fruit. appJe.orangcpeach.b aoana.gr ape, 
lemonxneBoagrapefruit strawberry 


M-2 






animal.catdo&pt£,cow P sheep, 
tiger,5on,elephanCmonkey 


M^3 






tree, maple, oak. birch, chestnut. 
pine.ceoar.acacia.cactus.cherTy 


M-4 


major 


4000(4%) 


s^or^^aseDaMjooujaii^asKetDal^sKf, 
mat iitlion.swrWkiii«ngjogging,sumo, tennis 


M-5 


major 


4000(4%) 


comoutar CPU HDO CD-RDM n\/n 
LAN.FD0 modem memorv PCMCIA 


0-1 


outlier 


2000 (2%) 


fish.satmon,carp.tuna.carft)e 


02 


outlier 


2000 (2%) 


vegetable.tomato.cucumber.purnpkm.lettuce 


0-3 


outlier 


2000 (2%) 


insects.buUerfty.antJbeetle.dragonfty 


04 


outlier 


2000(2%) 


IVYPrincetoaCornetl Harvard. Yale 


0-5 


outlier 


2000(2%) 


cfisaster.typhoon.tomado.earthquake.thunderstorm 


0-6 


outlier 


2000 (2%) 


coffee^nocha.BIue Mountain t arabic a, espresso 


0-7 


outlier 


2000 (2%) 


season,spnng.sumrner.autumn. winter 


O-8 


outKer 


2000 (2%) 


musiciaaBeethoven.BacKChopin.Mozart 


0-9 


outlier 


2000 (2%) 


mountam t Fujf,Everest>4atterhoraKilimanjaro 


0-10 


outlier 


2000(2%) 


jewel.diamond^oldlpearinjby 


0-11 


outJier 


2000 (2%) 


entraite,heartliver.stoniach.bowel 


0-12 


outlier 


2000 (2%) 


sense.eye.earjnouth.nose 


O-l 3 


outlier 


2000 (2%) | 


btrdU>igeon.crow.sparrow4>arrot 


0-14 


outlier 


2000(2%) 


shoes.sandafJ>ooU.sptke.higriheels 


0-15 


outlier 


2000(2%) 


rrver.Miss»ssippi^lile f Huang,Rhine 


0-16 


outlier 


2000 (2%) 


language.English.vlapanese.French.Chinese 


0-1 7 


outlier 


2000 (2%) 


presidentKertneoV.Wasbjneton.Uncoln.Rooseveft 


0-18 


outlier 


2000 (2%) 


artist.Cezanne,Van Gogh,Picasso.Renoir 


0-19 


outlier 


2000 (2%) 


colorxed,whrte.clue,green 


0-20 


outiier 


2000 (2%) 


rumiture^ed.bookshelf.cupboard.sofa 


Noise 


noise 


40000 

(below 1%) 


1850 distinct keywords 
(absence.** -.zirconium) 
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Cluster distribution (example 3) 




Cluster distribution (example 4) 
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QProsciutto 
file Tools Help 



HI 



/ 

<police,digest,county,officer> <> 



<Zurich,Swiss,London s Switzeriand> 




<Bush,U.S. f Soviet,lraq> 



r 
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